HIGH-LEVEL FLOW OF PROJECT

  1. Clean Data
  2. Feature Engineering & Exploratory Analysis:
    • Add conversion rate
    • Add rolling average and rolling std columns
    • Decomposing the trend, seasonality and residual for conversion rate
    • Transform conversion rate from non-stationary to stationary
  3. Model Selection Explanation
  4. Train ARMA model using the best order to get minimum AIC
  5. Make predictions based on ARMA model
  6. Find anomalies based on Mean and STD of Square Errors Threshold
  7. Visualize anomalies/ unusual changes
  8. Display the estimated before and after days of the changes
  9. Calculate confidence intervals of estimated conversion rates on days of change
  10. Calculate confidence intervals of estimated conversion rates before and after days of change

Load libraries and data

Cleaning data

No missing or duplicated value

Feature Engineering

Computing 7 day moving average and standard deviation to extract overall trend and seasonality

Exploratory Analysis

The rolling mean decrease overtime. Therefore, we can conclude that the time series is not stationary.

We can check whether the data is stationary or not using the Augmented Dickey-Fuller (ADF) test.

The ADF test returns the stats, which contain the p-value in it. The p-value is used to determine whether the data is stationary or not. If the p-value is less than 0.05 then the data is stationary or if the data is greater than 0.05 then the data is non-stationary. The 0.05 is called the significance level which corresponds to 95%. You can also try out different significance levels to test stationarity.

Decomposing the trend, seasonality and residual for conversion rate

Check if conversion rate is stationary

Convert conversion rate variable to stationary for times series modeling

Model Selection

After differencing the data has become stationary. Now we can use this data to train the time series model

Time-series data which is strictly sequential and has autocorrelation. We can’t use least-squares or linear regression directly here because of the autocorrelation of the residuals. The autocorrelation of the residuals would result in spurious predictions. That is why we use a special model constructed specially for time-series data.

Here, we are going to use ARMA (AutoRegressive Moving Average) model to forecast the data. The ARMA model has two parameters namely p and q. The p is for the AR (Auto Regression) and the q is for MA (Moving Average). The Auto Regression uses the previous lags to model the data and the Moving Average uses the previous forecast errors to model the data. We can combine the two complimentary views of modeling X_t into a single equation giving rise to the ARMA(p,q) model

Using historical data, we can select p and q in ARMA(p,q) model and learn the model coefficients {\theta} and {\phi} based on which we can make a future prediction. A substantial deviation from the prediction result in time-series anomaly. We can define substantial as two standard deviation from a moving average of X. The model parameteters {\theta} and {\phi} are learned via Maximum Likelihood Estimation (MLE) as implemented in the statsmodels package.

The conversion rate variable is non-seasonal we use ARMA over SARMA

Optimize AIC to get the best performed ARMA model

Akaike Information Critera (AIC) is a widely used measure of a statistical model. It basically quantifies 1) the goodness of fit 2) the simplicity/parsimony of the model into a single statistic. The one with the lower AIC is generally “better”.

Find the best order (p, q ) for the model we have to select the order (p, q) that reduces the AIC for the model we use the following code

To find the best order (p, q ) for the model we have to select the order (p, q) that reduces the AIC for the model. I created a function to traverse through different values of order (p,q) and compare AIC against each other, the result will give use the best order (p,q) with the minimum AIC.

ARMA Times Series Prediction

Now that we have found that (0, 1) is the best order we can use that to fit the ARMA model.

Train ARMA model on stationary series

Interpret the model

P>|z| are the p-values of the coefficients, they are both less than 0.05 so we have enough confidence to conclude that its prediction is statistically significant

The AIC is minimum as we optimize the order (p,q)

Residuals

Residual is the difference between the actual conversion rate and the fitted values by ARMA Times Series Model

The residuals follows normal distribution with the bell curve

Detect Outliers

Create find anomalies function based on threshold. Here, the threshold is defined as mean squared error plus one standard deviation of squared errors to detect unusual spikes or drops that fall above or below the usual mean squared error and one std of the conversion rate from the ARMA times series model

We can calculate the threshold differently based on discussion with stakeholders, business goals

Visualize anomalies/unusual changes in conversion rate

Unusual changes in conversion rate along with their upper and lower CI bounds

The table above shows us the estimates of the conversion rate(s) and the estimated day of changes

Unusual Drops in Conversion Rate

Based on the graphs, the estimate time and conversion rate of the drops along with their CI bounds are:

Changes and their before and after records

Highlight before and after records of the changes

Calculate Confidence Interval

Confidence intervals around the estimated conversion rate around the days of change

Confidence Interval is calculated according to this formula:

CI = mean +/- t*(s/sqrt(n))

Confidence Level: 95% , alpha = 5%

mean: mean of all the estimates

t_0.025,15 = 2.131

s: standard deviation of residuals

n: sample size of all the estimates df: degree of freedom = n - 1 = 15

t-distribution table: https://www.sjsu.edu/faculty/gerstman/StatPrimer/t-table.pdf

We are 95% confident that the population mean of conversion rate of change is between 0.3 and 0.477

Confidence intervals around the estimated conversion rate before and after days of change

We are 95% confident that the population mean of conversion rate of the days before and after the change is between 0.247 and 0.31